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Abstract. We exhibit some simple gadgets useful in designing shallow 
parallel circuits for quantum algorithms. We prove that any quantum 
circuit composed entirely of controlled-not gates or of diagonal gates can 
be parallelized to logarithmic depth, while circuits composed of both 
cannot. Finally, while we note the Quantum Fourier Transform can be 
parallelized to linear depth, we exhibit a simple quantum circuit related 
to it that we believe cannot be parallelized to less than linear depth, and 
therefore might be used to prove that QNC < QP. 



Much of computational complexity theory has focused on the question of 
what problems can be solved in polynomial time. Shor's quantum factoring al- 
gorithm [8] suggests that quantum computers might be more powerful than 
classical computers in this regard, i.e. that QBP might be a larger class than P, 
or rather BP, the class of problems solvable in polynomial time by a classical 
probabilistic Turing machine with bounded error. 

More recently, a distinction has been made between P and the class NC of 
efficient parallel computation, namely the subset of P of problems which can 
be solved by a parallel computer with a polynomial number of processors in 
polylogarithmic time, i.e. C(log fc n) time for some fc, where n is the number of 
bits of the input [7]. Equivalently, NC problems are those solvable by Boolean 
circuits with a polynomial number of gates and polylogarithmic depth. 

This distinction seems especially relevant for quantum computers, where de- 
coherence makes it difficult to do more than a limited number of computation 
steps reliably. Since decoherence due to storage errors is essentially a function 
of time, we can avoid it by doing as many of our quantum operations at once 
as possible; if we can parallelize our computation to logarithmic depth, we can 
solve exponentially larger problems. (Gate errors, on the other hand, will not 
be improved by parallelization, and may even get worse if the parallel algorithm 
involves more gates.) 

We define quantum operators and quantum circuits as follows: 

Definition 1. A quantum operator on n qubits is a unitary rank-2n tensor U 
where U°^'"^ 1S ^ rie amplitude of the incoming and outgoing truth values being 
<zi, <Z2, • • ■ a n and bi, 62, ■ ■ • b n respectively, with a,, bi £ {0, 1} for all i. However, 
we will usually write U as a 2 n x 2" unitary matrix U a b where a and 6's binary 
representations are a\02 ■ ■ ■ a n and 6162 • ■ • b n respectively. 



A one-layer circuit consists of the tensor product of one- and two-qubit gates, 
i.e. rank 2 and 4 tensors, or 2 x 2 and 4x4 unitary matrices. This is an operator 
that can be carried out by a set of simultaneous one-qubit and two-qubit gates, 
where each qubit interacts with at most one gate. 

A quantum circuit of depth k is a quantum operator written as the product 
of k one-layer circuits. The depth of a quantum operator is the depth of the 
shallowest circuit equal to it. 

Here we are allowing arbitrary two-qubit gates. If we like, we can restrict 
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the input and target qubit respectively, even though they don't really leave the 
input qubit unchanged, since they entangle it with the target. 

Since either of these can be combined with one-qubit gates to simulate ar- 
bitrary two-qubit gates [1], these restrictions would just multiply our definition 
of depth by a constant. The same is true if we wish to allow gates that couple 
k > 2 qubits as long as k is fixed, since any /c-qubit gate can be simulated by 
some constant number of two-qubit gates. 
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Fig. 1. Our notation for controllcd-not, controlled- U, symmetric phase shift, and arbi- 
trary diagonal gates. 



In order to design a shallow parallel circuit for a given quantum operator, 
we want to be able to use additional qubits or "ancillae" for intermediate steps 
in the computation, equivalent to additional processors in a parallel quantum 
computer. However, to avoid entanglement, we demand that our ancillae start 
and end in a pure state |0), so that the desired operator appears as the diagonal 
block of the operator performed by the circuit on the subspace where the ancillae 
are zero. 

Then in analogy with NC we propose the following definition: 

Definition 2. Let F be a family of quantum operators, i.e. F(n) is a 2" x 2" 
unitary matrix on n qubits. We say that F(n) is embedded in an operator M with 
m ancillae if M is a 2 m+n x 2 m+n matrix which preserves the subspace where 



the ancillae are set to |0), and if M is identical to F(n) ® l 2 *" when restricted 
to this subspace. 

Then F is in QNC fc if, for some constants c\, C2 and j, F(n) can be embedded 
in a circuit of depth at most c\ log fc n, with at most C2"n? ancillae. Then QNC = 
UfeQNC fe , the set of operators parallelizable to polylogarithmic depth with a 
polynomial number of ancillae. 

To extend this definition from quantum operators to decision problems in the 
classical sense, we have to choose a measurement protocol, and to what extent 
we want errors bounded. We will not explore those issues here. 

We will use the notation in figure 1 for our various gates: the controlled-not 

^10 ' 

and controlled- U, the symmetric phase shift [ 1 | , and arbitrary diagonal 
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gates 



1 Permutations 

In classical circuits, one can move wires around as much as one likes. In a quan- 
tum computer, it may be more difficult to move a qubit from place to place. 
However, we can easily show that we can do arbitrary permutations in constant 
depth: 

Proposition 3. Any permutation of n qubits can be performed in 4 layers of 
controlled-not gates with n ancillae, or in 6 layers with no ancillae. 
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Fig. 2. Permuting n qubits in 4 layers using n ancillae. 



Proof. The first part is obvious; simply copy the qubits into the ancillae, cancel 
the originals, recopy them from the ancillae in the desired order, and cancel the 
ancillae. This is shown in figure 2. 



5 4 4 5 4 3 

Fig. 3. Any cycle, and therefore any permutation, is the composition of two sets of 
disjoint transpositions. 




Fig. 4. Switching two qubits with three controlled-nots. 

Without ancillae, we can use the fact that any permutation can be written 
as the composition of two sets of disjoint transpositions [9]. To see this, first 
decompose it into a product of disjoint cycles, and then note that a cycle is the 
composition of two reflections, as shown in figure 3. Two qubits can be switched 
with 3 layers of controlled-not gates as shown in figure 4, so any permutation 
can be done in 6 layers. □ 



2 Fan-out 

To make a shallow parallel circuit, it is often important to fan out one of the 
inputs into multiple copies. The controlled-not gate can be used to copy a qubit 
onto an ancilla in the pure state |0) by making a non-destructive measurement: 

(a|0)+/?|l))®|0) -» a|00)+/3|ll) 

Note that the final state is not a tensor product of two independent qubits; 
the two qubits are completely entangled. Making an unentanglcd copy requires 
non-unitary, and in fact non-linear, processes since 

HO} + (3\1)) ® (a\0) + (3\1)) = a 2 \00) + a/3(|01) + |10» + (3 2 \U) 

has coefficients quadratic in a and /?. 

This means that disentangling or uncopying the ancillae by the end of the 
computation, returning them to their initial state |0), is a non-trivial and im- 
portant part of a quantum circuit. There are, however, some special cases where 
this can be done easily. 

Suppose we have a series of n controlled-?/ gates all with the same input 
qubit. Rather than applying them in scries, we can fan out the input into n 
copies by splitting it log 2 n times, apply them to the target qubits, and uncopy 
them afterward, thus reducing the circuit's depth to O(logn) depth. 
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Fig. 5. Parallelizing n controlled gates on a single input qubit q to O(logn) depth. 



Proposition 4. A series of n controlled gates coupling the same input to n 
target qubits can be parallelized to O(logn) depth with 0(n) ancillae. 

Proof. The circuit in figure 5 copies the input onto n — 1 ancillae, applies all the 
controlled gates simultaneously, and uncopies the ancillae back to their original 
state. Its total depth is 2 log 2 n + 1. □ 

This kind of symmetric circuit, in which we uncopy the ancillae to return 
them to their original state, is similar to circuits designed by the Reversible 
Computation Group at MIT [4] for reversible classical computers. 



3 Diagonal and mutually commuting gates 

Fan-in seems more difficult in general. Classically, if a single qubit receives con- 
trolled gates from n inputs, we can calculate the composition of these in (D(log n) 
time by composing them in pairs, but it is unclear when we can do this with 
unitary linear operators. One special case where it is possible is if all the gates 
are diagonal: 



Fig. 6. Using entanglement to parallelize diagonal operators. 




Fig. 7. Parallelizing n diagonal gates on a single qubit as in proposition 4. 

Propositions. A series of n diagonal gates coupling the same qubit to n others 
can be parallelized to O(logn) depth with 0(n) ancillae. 

Proof. Here the entanglement between two copies of a qubit becomes an asset. 
Since diagonal matrices don't mix Boolean states with each other, we can act 
on a qubit and an entangled copy of it with two diagonal matrices D\ and Z?2 
as in figure 6. When we uncopy the ancilla, we have the same effect as if we 
had applied both matrices to the original. Then the same kind of circuit as in 
proposition 4 works, as shown in figure 7. □ 

Since matrices commute if and only if they can be simultaneously diagonal- 
ized, we can generalize this to the case where a set of controlled-?/ gates applied 
to a given target qubit have mutually commuting Us: 

Proposition 6. A series of of n controlled-U gates acting on a single qubit, 
where the Us mutually commute, can be parallelized to O(logn) depth with 0{n) 
ancillae. 

Proof. Since the Us all commute, they can all be diagonalized by the same 
unitary operator T. Apply T to the target qubit, parallelize the circuit using 
proposition 5, and put the target qubit back in the original basis by applying 
T _1 . This is all done with a circuit of depth 2 log 2 n + 3. □ 

As an example, in figure 8 we show a circuit that applies the gth power of 
an operator U to a target qubit, where < q < 2 k is given by k control qubits 
as a binary integer. We can do this because U,U 2 ,U 4 , . . . can be simultaneously 
diagonalized, since U q = TD q T~ 1 . Note that this works for operators U that 
act on any number of qubits. 
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Fig. 8. Applying an operator (7 q times, where q is given in binary by the control 
qubits. 



We can extend this to circuits in general whose gates are mutually commut- 
ing, which includes diagonal gates: 

Proposition 7. Any circuit consisting of diagonal or mutually commuting gates, 
each of which couples at most k qubits, can be parallelized to depth (D(n k ~ 1 ) with 
no ancillae, and to depth O(logn) with 0(n k ) ancillae. Therefore, any family of 
such circuits is in QNC 1 . 

Proof. Since all the gates commute, we can sort them by which qubits they cou- 
ple, and arrive at a compressed circuit with one gate for each /c-tuple. This gives 

T j^j — 0(n k ) gates, but by performing groups of n/k disjoint gates simultane- 
ously we can do all of them in depth 0(n fc_1 ). 

By making C(n fe ~ 1 ) copies of each qubit, we can then use propositions 5 and 
6 to reduce this further to O(logn) depth. □ 

This is hardly surprising; after all, diagonal gates are just conditional phase 
shifts, and saying that two gates commute is almost like saying that they can be 
performed simultaneously. 



4 Circuits of controlled-not gates 



We can also fan in controllcd-not gates. Figure 9 shows how to implement n 
controlled-not gates on the same target qubit in depth 2 log 2 n+1. The ancillae 



carry the intermediate "sums mod 2" of the inputs, and we add them in pairs. 
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Fig. 9. Parallelizing n controlled-not gates to O(logn) depth by adding them in pairs. 



We can use a generalization of this circuit to show that any circuit composed 
entirely of controlled-not gates can be parallelized to logarithmic depth: 

Proposition 8. Any circuit on n qubits composed entirely of controlled-not gates 
can be parallelized to O(logn) depth with 0(n 2 ) ancillae. Therefore, any family 
of such circuits is in QNC 1 . 

Proof. First, note that in any circuit of controlled-not gates, if the n input qubits 
have binary values and are given by an n-dimensional vector q, then the output 
can be written Mq where M is an n x n matrix over the integers mod 2. Each 
of the output qubits can be written as a sum of up to n inputs, {Mq)i = ^ fe qj k 
where jk are those j for which My = 1 . 

We can break these sums down into binary trees. Let W n be the complete 
output sums, W n /2 be their left and right halves consisting of up to n/2 inputs, 
and so on down to single inputs. There are less than n 2 such intermediate sums 
Wk with k > 1. We assign an ancilla to each one, and build them up from the 
inputs in log 2 n stages, adding pairs from Wk to make M^fe- The first stage takes 
O(logn) time and an additional 0(n 2 ) ancillae since we may need to add each 
input into multiple members of W2, but each stage after that can be done in 
depth 2. 

To cancel the ancillae, we use the same cascade in reverse order, adding pairs 
from Wk to cancel Wik- This leaves us with the input q, the output Mq, and 
the ancillae set to zero. 

Now we use the fact that, since the circuit is unitary, M is invertible. Thus 
we can recalculate the input q = M~ 1 (Mq) and cancel it. We use the same 
ancillae in reverse order, building the inputs q out of Mq with a series of partial 
sums V2, V4, . . ., cancel q, and cancel the ancillae in reverse as before. All this is 
illustrated in figure 10. 

This leaves us with the output Mq and all other qubits zero. With four more 
layers as in proposition 3, we can shift the output back to the input qubits, and 
we're done. □ 
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Fig. 10. Parallelizing an arbitrary circuit of controlled-not gates to logarithmic depth. 



This result is hardly surprising; after all, these circuits are reversible Boolean 
circuits, and any classical circuit composed of controlled-not gates is in NC 1 . 
We just did a little extra work to disentangle the ancillae. 



5 Circuits with both diagonal and controlled-not gates 

We have shown that circuits composed of diagonal or controlled-not gates can 
be parallelized. Since circuits composed of both these kinds of gates have only 
one non-zero element in each row and column, they are really just classical re- 
versible circuits with phase shifts attached. Therefore, it's reasonable to ask 
whether propositions 7 and 8 can be combined; that is, whether arbitrary cir- 
cuits composed of phase shifts and controlled-not gates can be parallelized to 
logarithmic depth. 

In this section, we will show that this is not the case. However, this will not 
help us show that QNC < QP. 

Proposition 9. Any diagonal unitary operator on n qubits can be performed by 
a circuit consisting of an exponential number of controlled-not gates and one- 
qubit diagonal gates and no ancillae. 

Proof. Any diagonal unitary operator on n qubits consists of 2™ phase shifts, 




. If we write the phase angles as a 2"-dimcnsional vector oj, 



then the effect of composing two diagonal operators is simply to add these vectors 
mod 2ir. 

For each subset s of the set of qubits, define a vector fi s as +1 if the number 
of true qubits in s is even, and —1 if it is odd. If s is all the qubits, for instance, 
M{i...n} is the aperiodic Morse sequence (+1, —1, —1, +1, . . .) when written out 
linearly, but it really just means giving the odd and even nodes of the Boolean 
n-cube opposite signs. 



Fig. 11. A circuit for the phase shift Ofi s , i.e. a phase shift of +6 if the number of true 
qubits is even and —6 if it is odd. 

It is easy to see that the /i s for all s c {1, . . . , n} are linearly independent, 
and form a basis of R 2 " . Moreover, while diagonal gates coupling k qubits can 
only perform phase shifts spanned by those /j, s with \s\ < k, the circuit in figure 
11 can perform a phase shift proportional to fj, s for any s (incidentally, in depth 
0(log \S\) with no ancillae). Therefore, a series of 2 n such circuits, one for each 
subset of {1, . . . , n}, can express any diagonal unitary operator. □ 

Then we have the following corollary: 

Corollary 10. There are circuits composed of controlled-not gates and one-qubit 
diagonal gates that cannot be parallelized to less than exponential depth with a 
polynomial number of ancillae. 

Proof. Consider setting up a many-to-one correspondence between circuits and 
operators. The set of diagonal unitary operators on n qubits has 2™ continuous 
degrees of freedom, while the set of circuits of depth d and to ancillae has only 
0(d(m + n)) continuous degrees of freedom (and some discrete ones for the 
circuit's topology). Thus if to is polynomial, d must be exponential. □ 

Note that this counting argument does not help us distinguish QP from 
QNC, since both have a polynomial number of degrees of freedom. Neither does 
it help us exhibit a particular family of circuits which require exponential depth, 
since it is completely non-constructive. The classical situation is similar; there 
are 2 2 Boolean functions on n variables, but only 2°( dwl °s w ) circuits of depth 





d and width w. Thus the vast majority of Boolean functions require exponential 
depth if the width is polynomial, but proving a lower bound on the depth of a 
particular one remains elusive. 



6 QNC / QP? The staircase circuit 




Fig. 12. This "staircase" circuit seems hard to parallelize unless the operators are 
purely diagonal or off-diagonal. 



A simple, perhaps minimal, example of a quantum circuit that seems hard 
to parallelize is the "staircase" circuit shown in figure 12. This kind of structure 
appears in the standard circuit for the quantum Fourier transform, which has 
0(n 2 ) gates [2, 8]. Careful inspection shows that the QFT can in fact be paral- 
lelized to 0(n) depth as shown in figure 13 [5], but it seems difficult to do any 
better. Clearly, any fast parallel circuit for the QFT would be relevant to prime 
factoring and other problems the QFT is used for. 

If we define QP as the family of quantum operators that can be expressed 
with circuits of polynomial depth (again, leaving measurement issues aside for 
now), we can make the following conjecture: 

Conjecture 11 Staircase circuits composed of controlled-U gates other than di- 
agonal or off-diagonal gates (i.e. other than the special cases handled in propo- 
sitions 7 and 8) cannot be parallelized to less than linear depth. Therefore, 
QNC < QP. 



7 Conclusion 

We conclude with some questions for further work. 

Parsing classical context-free languages is in NC. Quantum context-free lan- 
guages have been defined in [6]. Is quantum parsing, i.e. producing derivation 
trees with the appropriate amplitudes, in QNC? 

Can circuits for quantum error correction such as those in [3] be parallelized 
to significantly smaller depth? If so, does this reduce the threshold error neces- 
sary for long-term computation, at least as far as storage errors are concerned? 
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Fig. 13. The standard circuit for the quantum Fourier transform on n qubits can be 
carried out in 2n — 1 layers. Can we do better? 

Finally, can the reader show that the staircase circuit cannot be parallelized, 
thus showing that QNC < QP? This would be quite significant, since corre- 
sponding classical question NC < P is still open. 
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